PSTAT 100 Course Project
Andrew Hansen & William Markevitch
Data Description¶
For this project, we will be using 2 datasets provided by this course: the first being the ClimateWatch historical emissions data, and the second being from merging several life expectancy smaller datasets, mostly collected from World Bank Open Data. For the Life dataset, the result is essentially a convenience sample, but descriptive analyses without inference are nonetheless interesting and suggestive. Meanwhile, the ClimateWatch dataset reflects a representative sample of the world, as it is analyzed as a subset of a population that seeks to accurately reflect the characteristics of the world.
In regards to the emissions, we delve into the historical greenhouse gas emissions dataset sourced from Climate Watch. This dataset serves as a comprehensive record of global emissions, capturing valuable insights into the evolving landscape of human-induced climate impact over time. The timeframe of the dataset spans historical periods, allowing for an exploration of trends and patterns in emissions over time. In the long run, this comprehensive historical context allows researchers, policymakers, and the public to discuss the evolution of emissions, identify critical patterns, and assess the effectiveness of curing strategies. Researchers can leverage this dataset to analyze and communicate the urgency of addressing climate change, advocating for sustainable practices and policies. The emissions dataset contains 195 observations of countries, along with their Green House Gas emission over the years 1990 to 2019. The data brings together dozens of datasets to let users analyze and compare the Nationally Determined Contributions (NDCs) and access historical emissions data.
The Life dataset originates from The Data World Bank, from the United Nations Population Division, World Population Prospects: 2022 Revision, or derived from male and female life expectancy at birth from sources, such as Census reports and other statistical publications from national statistical offices, and Eurostat Demographic Statistics, United Nations Statistical Division, Population and Vital Statistics Report, U.S. Census Bureau: International Database, and Secretariat of the Pacific Community: Statistics and Demography. These organizations provided us with a historical in-depth dataset from the years 2000, 2010, 2015, and 2019 of 158 countries, covering the total life expectancy of their population of that year, the male and female life expectancy, GDP per capita of countries, region, sub-region, and population. The dataset is a comprehensive resource for exploring the dynamics of global population health, economic development, and demographic trends. By including data on total life expectancy, male and female life expectancy, GDP per capita, and additional socio-economic factors, this dataset offers an in-depth understanding of the intersections between health outcomes and economic factors across 158 countries. Analysts can leverage this dataset to identify patterns, disparities, and potential correlations between life expectancy, and policymakers can inspect economic well-being and demographic variables to make essential decisions on the country's future.
Question of Interest¶
While examining the datasets, we noticed an overlap in the years of observations described and collected. This prompted our interest in exploring potential patterns between the two sets. These datasets were collected independently, offering an opportunity to investigate an unintended aspect. Upon closer inspection, we formulated a question: Does the evident increase in Greenhouse Gas Emissions over the past two decades correlate with the life expectancy of people worldwide? If so, can we reasonably conclude that one is causing the other, disregarding other factors? If not, are there other explanations, like a rise in GDP or population size? We emphasized not inferring causality from our datasets, focusing on understanding correlations and interactions among the variables involved.
First we need to install (or update) all necessary libraries for this project.
%%capture # If the libraries don't properly download, remove this %%capture
!pip install numpy
!pip install pandas
!pip install altair
!pip install geopandas
!pip install "vegafusion[embed]>=1.4.0"
Now we import all these libraries and use "alt.data_transformers.enable("vegafusion")" to enable plotting of large data sets.
import numpy as np
import pandas as pd
import altair as alt
import statsmodels.api as sm
import geopandas as gpd
alt.data_transformers.enable("vegafusion")
DataTransformerRegistry.enable('vegafusion')
Data analysis¶
Our first objective was to break down the datasets one at a time, check for correlation, patterns in their respective data and draw assumptions from them to then merge them. This meant first exploring the datasets one-by-one:
Exploration of the emissions dataset¶
In this portion of the project, we introduce our emissions dataset and use it to investigate trends of emissions over time
emissions = pd.read_csv('data/historical_emissions/historical_emissions.csv')
emissions.head()
| Country | Data source | Sector | Gas | Unit | 2019 | 2018 | 2017 | 2016 | 2015 | ... | 1999 | 1998 | 1997 | 1996 | 1995 | 1994 | 1993 | 1992 | 1991 | 1990 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | World | Climate Watch | Total including LUCF | All GHG | MtCO₂e | 49758.23 | 49368.04 | 48251.88 | 47531.68 | 46871.77 | ... | 35101.90 | 35099.21 | 35537.18 | 34179.33 | 33805.61 | 33015.04 | 32729.06 | 32588.09 | 32670.51 | 32523.58 |
| 1 | China | Climate Watch | Total including LUCF | All GHG | MtCO₂e | 12055.41 | 11821.66 | 11385.48 | 11151.31 | 11108.86 | ... | 4028.58 | 4095.97 | 3977.65 | 3982.11 | 3960.71 | 3557.37 | 3397.80 | 3168.05 | 3039.15 | 2891.73 |
| 2 | United States | Climate Watch | Total including LUCF | All GHG | MtCO₂e | 5771.00 | 5892.37 | 5689.61 | 5743.85 | 5665.21 | ... | 6210.12 | 6208.83 | 6160.86 | 5901.00 | 5729.69 | 5661.57 | 5567.55 | 5456.12 | 5372.08 | 5417.32 |
| 3 | India | Climate Watch | Total including LUCF | All GHG | MtCO₂e | 3363.60 | 3360.56 | 3215.07 | 3076.48 | 3003.07 | ... | 1440.38 | 1362.33 | 1331.88 | 1272.74 | 1223.65 | 1158.48 | 1114.22 | 1081.28 | 1056.25 | 1002.56 |
| 4 | European Union (27) | Climate Watch | Total including LUCF | All GHG | MtCO₂e | 3149.57 | 3295.53 | 3379.38 | 3364.77 | 3019.49 | ... | 3874.40 | 3949.25 | 3983.29 | 4058.46 | 3947.64 | 3894.63 | 3908.95 | 3981.08 | 4120.94 | 4187.90 |
5 rows × 35 columns
We first melt the emissions dataframe from having 20 columns of years into just one singular column called 'Year' so that we can use it in our analysis.
# Melt the dataframe so that the years is in one column only
emissions_melted_year = emissions.melt(id_vars=['Country', 'Data source', 'Sector', 'Gas', 'Unit'], var_name='Year', value_name='Emissions')
emissions_melted_year.head()
| Country | Data source | Sector | Gas | Unit | Year | Emissions | |
|---|---|---|---|---|---|---|---|
| 0 | World | Climate Watch | Total including LUCF | All GHG | MtCO₂e | 2019 | 49758.23 |
| 1 | China | Climate Watch | Total including LUCF | All GHG | MtCO₂e | 2019 | 12055.41 |
| 2 | United States | Climate Watch | Total including LUCF | All GHG | MtCO₂e | 2019 | 5771.00 |
| 3 | India | Climate Watch | Total including LUCF | All GHG | MtCO₂e | 2019 | 3363.60 |
| 4 | European Union (27) | Climate Watch | Total including LUCF | All GHG | MtCO₂e | 2019 | 3149.57 |
To get an understanding of the emission trends occuring in our data, we graph the emissions against the years for every country in our set.
# Plot all emissions of all years
emission_plot = alt.Chart(emissions_melted_year).mark_line().encode(
x=alt.X('Year:N', title='Year'),
y=alt.Y('Emissions:Q', title='Emissions'),
color='Country:N',
tooltip=['Country', 'Year', 'Emissions']
).properties(
title='Emissions Over Time for Each Country',
width=500,
height=300
)
emission_plot
In the displayed graph, we identified two outliers in our dataset, namely 'World' and 'European Union.' These entries encompass a larger collection of nations, introducing inconsistency compared to individual countries. To ensure data consistency, we opted to exclude both 'World' and 'European Union' from our melted dataset
# Remove 'World' and 'European Union' from dataset
world_col = emissions_melted_year['Country'] == 'World'
eu_col = emissions_melted_year['Country'] == "European Union (27)"
emissions_melted_year_without_world = emissions_melted_year[~world_col]
emissions_melted_year_countries_only = emissions_melted_year_without_world[~eu_col]
/tmp/ipykernel_705/586745506.py:5: UserWarning: Boolean Series key will be reindexed to match DataFrame index. emissions_melted_year_countries_only = emissions_melted_year_without_world[~eu_col]
Another way to visualize the increasing trend in emissions is to explore a graph illustrating the total emissions grouped by years, as shown below. This representation is akin to presenting the integral of our world line from the previous graph in bar chart form.
# Emissions over time
emissions_by_year = alt.Chart(emissions_melted_year_countries_only).mark_bar().encode(
x=alt.X('Year:N', title='Year'),
y=alt.Y('Emissions:Q', title='Emissions', scale=alt.Scale(zero=False)),
).properties(
title='Total Emissions Over the Years',
width=500,
height=300
)
emissions_by_year
For enhanced data visualization, we chose to generate a heatmap overlay on the world map, utilizing the geopandas library and focusing on the countries included in our exclusive dataset of individual countries.
# Calculate mean emissions
country_means = emissions_melted_year_countries_only.groupby("Country")[["Emissions"]].mean().reset_index()
# Join with geo dataframe
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
map_data = world.merge(country_means, left_on="name", right_on="Country", how="left")
# Create Altair choropleth
choropleth = alt.Chart(map_data).mark_geoshape().encode(
color=alt.Color('Emissions:Q', scale=alt.Scale(scheme='greens')),
tooltip=['Country:N', 'Emissions:Q']
).project(
type='equirectangular'
).properties(
title = 'Mean Emissions',
width=500,
height=300
)
choropleth.configure_legend(
title = None
)
choropleth
/tmp/ipykernel_705/1801656095.py:5: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
/opt/conda/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/opt/conda/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
col = df[col_name].apply(to_list_if_array, convert_dtype=False)
The map highlights the absence of certain countries from the heatmap. To address this discrepancy, we will conduct a comparison between the merged datasets, specifically examining countries that differ between the two datasets. This discrepancy arises as a result of our merging process, which was based on country names.
mismatches = set(world['name']) - set(country_means['Country'])
print("Mismatches between 'name' and 'Country':", mismatches)
Mismatches between 'name' and 'Country': {'eSwatini', 'United States of America', 'Antarctica', 'Puerto Rico', 'Eq. Guinea', 'Greenland', 'Kosovo', 'New Caledonia', 'Dem. Rep. Congo', 'Bosnia and Herz.', 'Congo', 'North Macedonia', 'Czechia', 'Central African Rep.', 'Fr. S. Antarctic Lands', 'W. Sahara', 'S. Sudan', 'Falkland Is.', 'Somaliland', 'Turkey', 'N. Cyprus', 'Dominican Rep.', 'Solomon Is.', 'Taiwan', 'Palestine'}
Having identified the countries with discrepancies, we proceed to conduct a comprehensive analysis. This involves distinguishing between discrepancies arising from variations in spelling or punctuation and those resulting from countries not being included in our dataset but present in the world dataset sourced from geopandas. To rectify these issues, we utilize dictionaries and employ methods such as .replace() or filter out the necessary countries for removal.
# Calculate mean emissions
country_means = emissions_melted_year_countries_only.groupby("Country")[["Emissions"]].mean().reset_index()
# Join with geo dataframe
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Manually standardizing names for Country mismatches
# Countries to change name
replace_dict = {
'Czechia': 'Czech Republic',
'Congo': 'Republic of Congo',
'Bosnia and Herz.': 'Bosnia and Herzegovina',
'eSwatini': 'Eswatini',
'Somaliland': 'Somalia',
'N. Cyprus': 'Cyprus',
'S. Sudan': 'South Sudan',
'Dem. Rep. Congo': 'Democratic Republic of the Congo',
'Eq. Guinea': 'Equatorial Guinea',
'Dominican Rep.': 'Dominican Republic',
'Central African Rep.': 'Central African Republic',
'Solomon Is.': 'Solomon Islands',
'United States of America': 'United States'
}
# Replace
world['name'] = world['name'].replace(replace_dict)
# Countries to drop
drop_countries = {
'New Caledonia',
'Turkey',
'Puerto Rico',
'W. Sahara',
'Taiwan',
'Falkland Is.',
'Greenland',
'Palestine',
'Fr. S. Antarctic Lands',
'Antarctica',
'Kosovo',
'North Macedonia'
}
# Filter
world = world[~world.name.isin(drop_countries)]
map_data = world.merge(country_means, left_on="name", right_on="Country", how="left")
# Create Altair choropleth
choropleth = alt.Chart(map_data).mark_geoshape().encode(
color=alt.Color('Emissions:Q', scale=alt.Scale(scheme='greens')),
tooltip=['Country:N', 'Emissions:Q']
).project(
type='equirectangular'
).properties(
title = 'Mean Emissions',
width=500,
height=300
)
choropleth.configure_legend(
title = None
)
choropleth
/tmp/ipykernel_705/2537791032.py:5: FutureWarning: The geopandas.dataset module is deprecated and will be removed in GeoPandas 1.0. You can get the original 'naturalearth_lowres' data from https://www.naturalearthdata.com/downloads/110m-cultural-vectors/.
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
/opt/conda/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
col = df[col_name].apply(to_list_if_array, convert_dtype=False)
/opt/conda/lib/python3.11/site-packages/altair/utils/core.py:410: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
col = df[col_name].apply(to_list_if_array, convert_dtype=False)
To ensure the map is complete, let's review the data below and confirm that there are no more differences between the two datasets.
mismatches_after = set(world['name']) - set(country_means['Country'])
print("Mismatches between 'name' and 'Country':", mismatches_after)
Mismatches between 'name' and 'Country': set()
Returning to our line plot showcasing total emissions over time for each country, we encountered a challenge. The majority of smaller emissions from various countries were clustered at the bottom of the graph. To address this, we opted to highlight the top 10 highest emitting countries while combining the rest into a single category labeled 'Others.' This approach enables us to create an aesthetically pleasing plot depicting the log-transformed emissions of the top 10 countries over the years. Furthermore, for added context, we can visualize the cumulative greenhouse gas emissions over the years for all countries, revealing a consistent upward trend with a notable pause around 1988.
# Keep all top 10 emitting countries and group the rest as 'Others'
top_10_countries = emissions_melted_year_countries_only.groupby('Country')['Emissions'].sum().nlargest(10).index
top10_em = emissions_melted_year
top10_em['Country'] = top10_em['Country'].apply(lambda x: x if x in top_10_countries else 'Other')
top10_em = top10_em.groupby(['Country','Year'])['Emissions'].mean().reset_index()
#Graph emissions over time
top10_emissions_plot = alt.Chart(top10_em).mark_line(point=True).encode(
x=alt.X('Year:N', title='Year'),
y=alt.Y('Emissions:Q',scale=alt.Scale(type='log'), title='Emissions (log)'),
color='Country:N',
tooltip=['Country', 'Year', 'Emissions']
).properties(
title='Emissions Over Time for Top Emitting Countries',
width=400,
height=300
)
top10_emissions_plot
Examining the total emissions across our time period for each of the top emitting countries, we can enhance our visualization by representing this information in a bar chart. This format allows for a clearer depiction of the significant drop-off in emissions between the highest contributors.
emissions_top_10_table = emissions_melted_year_countries_only.groupby('Country')['Emissions'].sum().nlargest(10).reset_index()
#Top 10 most Emitting countries
emission_top_countries = alt.Chart(emissions_top_10_table).mark_bar().encode(
x=alt.X('Country:N' ,sort='-y', title=''),
y=alt.Y('Emissions:Q', title='Emissions'),
).properties(
title='Top 10 Emission Countries Over the Years',
width=500,
height=300
)
emission_top_countries
Exploration of the life dataset¶
In this portion of the project, we introduce our life dataset and use it to investigate correlations population and life expectancy across regions and sub-regions
y = ['2019','2015','2010', 'Country','Data source','Sector','Unit']
emissions_intersect = emissions[emissions.columns.intersection(y)]
emissions_melted = pd.melt(emissions_intersect, id_vars=['Country', 'Data source', 'Sector', 'Unit'],
value_vars=['2010', '2015', '2019'],
var_name='Year', value_name='Emissions')
life = pd.read_csv('data/historical_emissions/life-gdp.csv')
life = life.rename(columns={"Country Name": "Country", 'region':"Region", 'All':'Life_expectancy'})
life.head()
| Country | Year | Life_expectancy | Male | Female | GDP per capita | Region | sub-region | Population | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2019 | 63.2 | 63.3 | 63.2 | 507.103432 | Asia | Southern Asia | 38041754.0 |
| 1 | Afghanistan | 2015 | 61.7 | 61.0 | 62.3 | 578.466353 | Asia | Southern Asia | 34413603.0 |
| 2 | Afghanistan | 2010 | 59.9 | 59.6 | 60.3 | 543.303042 | Asia | Southern Asia | 29185507.0 |
| 3 | Albania | 2019 | 78.0 | 76.3 | 79.9 | 5353.244856 | Europe | Southern Europe | 2854191.0 |
| 4 | Albania | 2015 | 77.8 | 76.1 | 79.7 | 3952.801215 | Europe | Southern Europe | 2880703.0 |
life_by_population = life.groupby(['Region'])['Population'].sum().reset_index()
total_population = life_by_population['Population'].sum()
life_by_population['Proportion'] = (life_by_population['Population'] / total_population)
life_by_population.sort_values(by=['Proportion'], ascending=True)
life_by_population.head()
| Region | Population | Proportion | |
|---|---|---|---|
| 0 | Africa | 3.342706e+09 | 0.139592 |
| 1 | Americas | 2.334736e+09 | 0.097499 |
| 2 | Asia | 1.550806e+10 | 0.647620 |
| 3 | Europe | 2.614412e+09 | 0.109178 |
| 4 | Oceania | 1.463373e+08 | 0.006111 |
This bar chart displays populations of sub-regions on the x-axis, with color-coded groupings representing individual regions. This provides a comprehensive view of population distribution, enabling a clear understanding of both sub-regional and regional population concentrations.
pop_subregion = life.groupby(['sub-region', 'Region'])['Population'].sum().reset_index()
region_pop = alt.Chart(pop_subregion).mark_bar().encode(
x=alt.X('sub-region:N',sort='-y', title = 'Sub-Region'),
y=alt.Y('Population:Q'),
color=alt.Color('Region:N'),
tooltip=['sub-region', 'Population'],
).properties(
width=400,
height=300,
title='Population by Region and Sub-Region'
)
region_pop
With a clearer understanding of the population distribution, let's now explore the life expectancy distribution within our dataset to gain insights into overall trends.
# Plot of all observations in the life df for life expentancy
life_expectancy = alt.Chart(life).mark_bar().encode(
x = alt.X('Life_expectancy',
bin = alt.Bin(step = 2),
title = 'Life Expectancy at Birth'),
y = alt.Y('count()', title = 'Counts')).properties(
width=500,
height=300
)
life_expectancy
Next, we present the mean life expectancy for each sub-region, accompanied by a red line parallel to the x-axis, representing the overall mean life expectancy derived from our dataset.
mean_life_by_subregion = life.groupby('sub-region')['Life_expectancy'].mean().reset_index()
# Create a bar chart
chart = alt.Chart(mean_life_by_subregion).mark_bar().encode(
x=alt.X('sub-region', sort = '-y', title = 'Sub-Region'),
y='Life_expectancy:Q'
).properties(
title='Mean Life Expectancy by Sub-Region',
width=500,
height=300
)
mean_line = alt.Chart(life).mark_rule(color='red').encode(
y=alt.Y('mean(Life_expectancy):Q', axis=alt.Axis(title='Mean Life Expectancy'))
)
chart + mean_line
The table below summarizes the worldwide average life expectancy for each recorded year, providing a glimpse of the corresponding percentage changes.
life_exp_by_region = life.groupby('Year').agg({'Life_expectancy':'mean'}).reset_index()
life_exp_by_region['Life Expectancy Percentage Change'] = (life_exp_by_region['Life_expectancy'].diff() / life_exp_by_region['Life_expectancy'].shift(1)) * 100
life_exp_by_region['Life Expectancy Percentage Change'] = life_exp_by_region['Life Expectancy Percentage Change'].map('{:.2f}%'.format)
life_exp_by_region.head()
| Year | Life_expectancy | Life Expectancy Percentage Change | |
|---|---|---|---|
| 0 | 2000 | 67.081169 | nan% |
| 1 | 2010 | 70.096815 | 4.50% |
| 2 | 2015 | 71.756410 | 2.37% |
| 3 | 2019 | 72.806536 | 1.46% |
The subsequent table displays the global mean life expectancy for each recorded year, categorized by sex, allowing for an examination of potential differences between men and women throughout the same timeframe.
life_exp_by_gender = life.groupby('Year').agg({'Male':'mean','Female':'mean'}).reset_index()
life_exp_by_gender_melt = life_exp_by_gender.melt(id_vars='Year', value_vars=['Male', 'Female'], var_name='Gender', value_name='Life Expectancy')
life_exp_by_gender['Male_Percentage_Change'] = (life_exp_by_gender['Male'].diff() / life_exp_by_gender['Male'].shift(1)) * 100
life_exp_by_gender['Female_Percentage_Change'] = (life_exp_by_gender['Female'].diff() / life_exp_by_gender['Female'].shift(1)) * 100
# Add a '%' sign to the columns
life_exp_by_gender['Male_Percentage_Change'] = life_exp_by_gender['Male_Percentage_Change'].map('{:.2f}%'.format)
life_exp_by_gender['Female_Percentage_Change'] = life_exp_by_gender['Female_Percentage_Change'].map('{:.2f}%'.format)
life_exp_by_gender.head()
| Year | Male | Female | Male_Percentage_Change | Female_Percentage_Change | |
|---|---|---|---|---|---|
| 0 | 2000 | 64.625325 | 69.629221 | nan% | nan% |
| 1 | 2010 | 67.682803 | 72.568153 | 4.73% | 4.22% |
| 2 | 2015 | 69.358974 | 74.204487 | 2.48% | 2.25% |
| 3 | 2019 | 70.436601 | 75.236601 | 1.55% | 1.39% |
Utilizing the recently created dataframe, we can visually explore the mean life expectancy for each of the four years with available data, categorized by sex, in a bar chart.
exp_gender_graph = alt.Chart(life_exp_by_gender_melt).mark_bar().encode(
x=alt.X('Gender:N', title = ''),
y=alt.Y('Life Expectancy:Q', title='Life Expectancy'),
color='Gender:N',
column=alt.Column('Year:O', title='')
).properties(
title='Mean Life Expectancy by Gender across Years',
width=100,
height=300
)
exp_gender_graph
Here, we generate a scatter plot depicting the relationship between log-transformed population and life expectancy for each country. Additionally, we've organized the countries by region and applied distinct color-coding to provide a more comprehensive visual representation.
scatter_chart = alt.Chart(life).mark_circle().encode(
x= alt.X('Life_expectancy:Q', title = 'Life Expectancy'),
y = alt.Y('Population:Q', scale=alt.Scale(type='log'), title = 'Population (log)'),
color='Region:N'
).properties(
title='Population vs Life Expectancy by Region',
width=400,
height=300
)
scatter_chart
Investigating correlations between life expectancy and GDP per capita, we depict the two variables in a plot, incorporating color-coding by region and utilizing point size to represent population size. Additionally, regression lines are incorporated for each region, culminating in the following visual representation.
#scatter plot of relation of 'Life_expectancy' and GDP per capita
scatter_gdp = alt.Chart(life).mark_circle().encode(
x=alt.X('GDP per capita:Q', scale = alt.Scale(type = 'log'), title = 'GDP per capita (log)'),
y=alt.Y('Life_expectancy:Q', title = 'Life Expectancy'),
color = 'Region:N',
size = alt.Size('Population',scale = alt.Scale(type = 'sqrt'))
).properties(
title='Life Expectancy vs GDP per Capita',
width=400,
height=400
)
loess_line = scatter_gdp.transform_loess(
on = 'GDP per capita',
loess = 'Life_expectancy',
bandwidth = 0.15
).mark_line(color='black')
reg_line = scatter_gdp.transform_regression(
groupby = ['Region'],
on = 'GDP per capita',
regression = 'Life_expectancy'
).mark_line(color = 'pink')
(scatter_gdp + reg_line).properties(width=400, height=400)
The presented table displays total GDP per capita sorted in descending order by region, providing insights into regions with the highest GDP per capita within our country pool in the life dataset.
region_gdp = life.groupby('Region')['GDP per capita'].sum().reset_index()
region_gdp.sort_values(by=['GDP per capita'], ascending=False)
| Region | GDP per capita | |
|---|---|---|
| 3 | Europe | 3.716451e+06 |
| 2 | Asia | 1.772185e+06 |
| 1 | Americas | 8.945901e+05 |
| 0 | Africa | 4.276947e+05 |
| 4 | Oceania | 3.903300e+05 |
Making a slight adjustment to the previous table, the current one presents the mean GDP per capita sorted in descending order by region. Furthermore, we have integrated the life expectancy of the regions, allowing for the observation of any disparities between the mean GDP per capita and life expectancy in each region.
region_life_expectancy = life.groupby('Region').mean(numeric_only=True).drop(columns=['Year','Male','Female',
'Population']).reset_index()
region_life_expectancy.sort_values(by=['GDP per capita'], ascending=False)
| Region | Life_expectancy | GDP per capita | |
|---|---|---|---|
| 3 | Europe | 77.602857 | 26546.081746 |
| 2 | Asia | 72.394805 | 11507.692835 |
| 4 | Oceania | 69.105556 | 10842.500132 |
| 1 | Americas | 74.040187 | 8360.655397 |
| 0 | Africa | 61.452459 | 2337.129273 |
Merging and Joint Investigation of Emissions & Life¶
In this portion of the project, we merge our emissions and life datasets in order to investigate correlations between emissions and various other variables
Initially, we merge the two datasets and filter the observations so that only those with complete records for all four years in the life dataset are incorporated into the merged dataset with emissions data.
y = ['2019','2015','2010','2000', 'Country','Data source','Sector','Unit']
emissions1 = emissions[emissions.columns.intersection(y)]
emissions_melted = pd.melt(emissions1, id_vars=['Country', 'Data source', 'Sector', 'Unit'],
value_vars=['2000','2010', '2015', '2019'],
var_name='Year', value_name='Emissions')
merged_df = pd.merge(emissions_melted, life, on='Country', how='inner')
merged_df['Year_x'] = merged_df['Year_x'].astype(int)
merged_df['Year_y'] = merged_df['Year_y'].astype(int)
merged_df = merged_df[merged_df['Year_x']==merged_df['Year_y']]
merged_df = merged_df.drop(columns = 'Year_x')
merged_df=merged_df.rename(columns = {'Year_y':"Year"})
merged_df = merged_df.drop(columns = ['Data source', 'Sector', 'Unit'])
merged_df = merged_df.groupby('Country').filter(lambda x: all(year in x['Year'].values for year in [2000, 2010, 2015, 2019]))
merged_df.groupby(['Country','Year']).sum()
merged_df.head()
| Country | Emissions | Year | Life_expectancy | Male | Female | GDP per capita | Region | sub-region | Population | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | China | 4221.08 | 2000 | 71.6 | 69.3 | 74.2 | 959.372484 | Asia | Eastern Asia | 1.262645e+09 |
| 6 | China | 9887.06 | 2010 | 74.9 | 72.3 | 77.9 | 4550.453108 | Asia | Eastern Asia | 1.337705e+09 |
| 9 | China | 11108.86 | 2015 | 76.6 | 73.9 | 79.9 | 8066.942635 | Asia | Eastern Asia | 1.371220e+09 |
| 12 | China | 12055.41 | 2019 | 77.4 | 74.7 | 80.5 | 10216.630330 | Asia | Eastern Asia | 1.397715e+09 |
| 19 | India | 1477.87 | 2000 | 62.1 | 61.3 | 62.9 | 443.314193 | Asia | Southern Asia | 1.056576e+09 |
Now, a variable representing population in millions can be generated for use in various plots.
xy = merged_df.groupby(['Country','Year'])['Emissions'].sum().reset_index()
xy['Emission_Increase'] = xy.groupby('Country')['Emissions'].diff()
xy.nlargest(10, 'Emission_Increase').reset_index()
merged_df['Population_in_millions'] = merged_df['Population'].astype(int)/1000000
merged_df.head()
| Country | Emissions | Year | Life_expectancy | Male | Female | GDP per capita | Region | sub-region | Population | Population_in_millions | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | China | 4221.08 | 2000 | 71.6 | 69.3 | 74.2 | 959.372484 | Asia | Eastern Asia | 1.262645e+09 | 1262.645000 |
| 6 | China | 9887.06 | 2010 | 74.9 | 72.3 | 77.9 | 4550.453108 | Asia | Eastern Asia | 1.337705e+09 | 1337.705000 |
| 9 | China | 11108.86 | 2015 | 76.6 | 73.9 | 79.9 | 8066.942635 | Asia | Eastern Asia | 1.371220e+09 | 1371.220000 |
| 12 | China | 12055.41 | 2019 | 77.4 | 74.7 | 80.5 | 10216.630330 | Asia | Eastern Asia | 1.397715e+09 | 1397.715000 |
| 19 | India | 1477.87 | 2000 | 62.1 | 61.3 | 62.9 | 443.314193 | Asia | Southern Asia | 1.056576e+09 | 1056.575549 |
Due to the disproportionate nature of greenhouse gas emissions worldwide, it can help us gain perspective by filtering out the top 10 countries with the highest mean emissions to view more comparable country emission values. Below we make a scatter plot with population of the non-top 10 emitting nations (in millions) versus mean emissions (square rooted).
mean_emissions = merged_df.groupby('Country')['Emissions'].mean()
# Get the top 10 countries with the highest mean emissions
top_10_countries = mean_emissions.nlargest(10).index
# Create a new table that removes the top 10 countries with the highest mean emissions
filtered_df = merged_df[~merged_df['Country'].isin(top_10_countries)]
filtered_df.head()
pop_in_millions_versus_mean_emissions_sqrt = alt.Chart(merged_df).mark_point().encode(
x = alt.X('Emissions:Q', scale = alt.Scale(type = 'sqrt', zero = False), title = 'Mean Emissions (Square Rooted)'),
y = alt.Y('Population_in_millions', title = 'Population (in millions)')
)
pop_smooth = pop_in_millions_versus_mean_emissions_sqrt.transform_loess('Emissions', 'Population_in_millions',
bandwidth = 0.25).mark_line(color = 'black')
# Combining the scatter plot and the smooth line
combined_graph = (pop_in_millions_versus_mean_emissions_sqrt + pop_smooth).properties(width=450, height=400)
combined_graph
temp = merged_df[['Country','Year','Emissions', 'Life_expectancy']].reset_index()
temp = temp.drop(columns = 'index')
top_10_emitters = temp.groupby('Country').mean().nlargest(10, 'Emissions').reset_index()
Below is a straightforward scatter plot depicting the life expectancy versus log-Emissions (scaled for clarity) for the top 10 countries, with each country color-coded:
alt.Chart(top_10_emitters, title='Life Expectancy vs Emissions').mark_circle().encode(
x=alt.X('Emissions', scale = alt.Scale(type ='log'), title = 'Emissions (log)'),
y=alt.Y('Life_expectancy', title = 'Life Expectancy'),
color='Country:N').properties(width=400, height=400)
Now we can adjust our DataFrame to make sure that no countries with negative emissions are listed, as this can't be logged for future charts. Therefore we can set all countries with negative emissions to 0.01 to work around this, and we don't lose much as there are so few countries in the negative and these negative magnitudes are statistically insignificant
# Adjusting to make sure that no countries with negative emissions are listed, as this can't be logged for future charts
merged_df_min_em = merged_df
merged_df_min_em['Emissions'] = merged_df_min_em['Emissions'].apply(lambda x: max(x, 0.01))
merged_df_min_em.head()
| Country | Emissions | Year | Life_expectancy | Male | Female | GDP per capita | Region | sub-region | Population | Population_in_millions | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | China | 4221.08 | 2000 | 71.6 | 69.3 | 74.2 | 959.372484 | Asia | Eastern Asia | 1.262645e+09 | 1262.645000 |
| 6 | China | 9887.06 | 2010 | 74.9 | 72.3 | 77.9 | 4550.453108 | Asia | Eastern Asia | 1.337705e+09 | 1337.705000 |
| 9 | China | 11108.86 | 2015 | 76.6 | 73.9 | 79.9 | 8066.942635 | Asia | Eastern Asia | 1.371220e+09 | 1371.220000 |
| 12 | China | 12055.41 | 2019 | 77.4 | 74.7 | 80.5 | 10216.630330 | Asia | Eastern Asia | 1.397715e+09 | 1397.715000 |
| 19 | India | 1477.87 | 2000 | 62.1 | 61.3 | 62.9 | 443.314193 | Asia | Southern Asia | 1.056576e+09 | 1056.575549 |
Next, we generate a scatter plot for countries outside the top 10 emitters, illustrating life expectancy against the square root of GDP per capita (scaled for clarity). Additionally, we overlay a LOESS regression line on the plot.
mean_emissions_min = merged_df_min_em.groupby('Country')['Emissions'].mean()
# Get the top 10 countries with the highest mean emissions
top_10_countries = mean_emissions_min.nlargest(20).index
# Create a new table that removes the top 10 countries with the highest mean emissions
filtered_df1 = merged_df_min_em[~merged_df['Country'].isin(top_10_countries)]
filtered_df1.head()
f_graph = alt.Chart(filtered_df1).mark_point().encode(
x = alt.X('GDP per capita:Q', scale = alt.Scale(type = 'sqrt')),
y = alt.Y('Life_expectancy',scale = alt.Scale(zero = False), title = 'Life Expectancy'))
f_smooth = f_graph.transform_loess('GDP per capita', 'Life_expectancy',
bandwidth = 0.3).mark_line(color = 'black')
(f_graph + f_smooth).properties(width=450, height=400)
In contrast, this scatter plot illustrates the life expectancy versus log emissions (scaled for clarity) for countries outside the top 10 emitters. The regions are color-coded, and the size of the points represents population size. Additionally, we include another LOESS regression line on the plot.
m = alt.Chart(merged_df_min_em).mark_circle(opacity = 0.5).encode(
x = alt.X('Emissions', scale = alt.Scale(type = 'log', zero=False)),
y = alt.Y('Life_expectancy', title = 'Life Expectancy', scale = alt.Scale(zero = False)),
color = 'Region',
size = alt.Size('Population', scale = alt.Scale(type = 'sqrt'))
)#.facet(column = 'Year')
m + m.transform_loess(
on = 'Emissions',
loess = 'Life_expectancy',
bandwidth = 0.4
).transform_calculate(
Region='"LOESS line"'
).mark_line(color='black').properties(width=400, height=400)
merged_mf = merged_df.drop(columns=['Country','Year','Male', 'Female','Region','sub-region','Population_in_millions'])
To explore the correlations between our variables, we can generate a heatmap-style correlation matrix, as depicted belo:w
corr_mx = merged_mf.corr()
# melt corr_mx
corr_mx_long = corr_mx.reset_index().rename(
columns = {'index': 'row'}
).melt(
id_vars = 'row',
var_name = 'col',
value_name = 'Correlation'
)
# construct plot
alt.Chart(corr_mx_long).mark_rect().encode(
x = alt.X('col', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
y = alt.Y('row', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
color = alt.Color('Correlation',
scale = alt.Scale(scheme = 'blueorange', # diverging gradient
domain = (-1, 1), # ensure white = 0
type = 'sqrt'), # adjust gradient scale
legend = alt.Legend(tickCount = 5)) # add ticks to colorbar at 0.5 for reference
).properties(width = 300, height = 300)
The visual display illustrates a positive correlation between GDP per capita and population, as well as between population and emissions, as emphasized in the above plot. Additionally, there are several other relationships, which we will delve into further later on.
Regression Analysis¶
To understand trends in emissions against our numerical predictors, we set up a linear regression model to analyze the trends and correlations between our variables. We first cut down our dataset into one we can use for our regression analysis, keeping only our numerical response variable: ‘Emissions’, and our numerical predictors: ‘Life_expectancy’, ‘Population’, and ‘GDP per capita’. From there, we create a Multiple Linear Regression Model containing all 3 of our predictors, then we cut it down to 2 predictors, considering every combination of predictors, and then create 3 separate simple regression models. With every model, we are able to extract the R^2 value, which captures how large or how little variance the model predicts of the response variable. The R^2 value is often used as a goodness-of-fit measure to assess how well the independent variables explain the variability in the dependent variable(s).
reg_data = merged_df.drop(columns=['Country','Year','Male', 'Female','Region','sub-region','Population_in_millions'])
reg_data = reg_data.rename(columns = {'GDP per capita':'GDP_per_capita'})
reg_data.head()
| Emissions | Life_expectancy | GDP_per_capita | Population | |
|---|---|---|---|---|
| 3 | 4221.08 | 71.6 | 959.372484 | 1.262645e+09 |
| 6 | 9887.06 | 74.9 | 4550.453108 | 1.337705e+09 |
| 9 | 11108.86 | 76.6 | 8066.942635 | 1.371220e+09 |
| 12 | 12055.41 | 77.4 | 10216.630330 | 1.397715e+09 |
| 19 | 1477.87 | 62.1 | 443.314193 | 1.056576e+09 |
# Treat emissions as variable of interest
# MLR 1
# retrieve response
y = reg_data.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.loc[:, ['Life_expectancy', 'GDP_per_capita','Population']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
coef_tbl = pd.DataFrame({
'estimate': rslt.params.values,
'standard error': np.sqrt(rslt.cov_params().values.diagonal())
},
index = x.columns
)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale
coef_tbl
| estimate | standard error | |
|---|---|---|
| const | -402.176213 | 1.781174e+02 |
| Life_expectancy | 5.741325 | 2.647596e+00 |
| GDP_per_capita | 0.001480 | 1.352472e-03 |
| Population | 0.000005 | 1.249454e-07 |
| error variance | 211721.361782 | NaN |
rslt.rsquared
0.7199793960178814
#Treat emissions as variable of interest
#MLR 2
# retrieve response
y = reg_data.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.loc[:, ['GDP_per_capita','Population']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
coef_tbl = pd.DataFrame({
'estimate': rslt.params.values,
'standard error': np.sqrt(rslt.cov_params().values.diagonal())
},
index = x.columns
)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale
coef_tbl
rslt.rsquared
0.7176933327761046
#Treat emissions as variable of interest
#MLR 3
# retrieve response
y = reg_data.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.loc[:, ['Life_expectancy','Population']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
coef_tbl = pd.DataFrame({
'estimate': rslt.params.values,
'standard error': np.sqrt(rslt.cov_params().values.diagonal())
},
index = x.columns
)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale
coef_tbl
rslt.rsquared
0.719397247629431
#Treat emissions as variable of interest
#MLR 4
# retrieve response
y = reg_data.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.loc[:, ['Life_expectancy','GDP_per_capita']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
coef_tbl = pd.DataFrame({
'estimate': rslt.params.values,
'standard error': np.sqrt(rslt.cov_params().values.diagonal())
},
index = x.columns
)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale
coef_tbl
rslt.rsquared
0.012482771328017228
#SLR (population)
# retrieve response
y = reg_data.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.Population)
# print first five rows of x
slr = sm.OLS(endog = y, exog = x)
rslt = slr.fit()
coef_tbl = pd.DataFrame({'estimate': rslt.params.values,
'standard error': np.sqrt(rslt.cov_params().values.diagonal())},
index = x.columns)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale
coef_tbl
print(coef_tbl)
print('-------')
print('R^2 = ',rslt.rsquared)
estimate standard error const 20.110843 1.991323e+01 Population 0.000005 1.255190e-07 error variance 215918.570765 NaN ------- R^2 = 0.7134366397152028
#SLR (Life Expectancy)
# retrieve response
y = reg_data.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.Life_expectancy)
# print first five rows of x
slr = sm.OLS(endog = y, exog = x)
rslt = slr.fit()
coef_tbl = pd.DataFrame({'estimate': rslt.params.values,
'standard error': np.sqrt(rslt.cov_params().values.diagonal())},
index = x.columns)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale
coef_tbl
print(coef_tbl)
print('-------')
print('R^2 = ',rslt.rsquared)
estimate standard error const -432.766747 284.093073 Life_expectancy 9.082499 4.001055 error variance 746817.775677 NaN ------- R^2 = 0.008836476822486672
#SLR (GDP_per_capita)
# retrieve response
y = reg_data.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_data.GDP_per_capita)
# print first five rows of x
slr = sm.OLS(endog = y, exog = x)
rslt = slr.fit()
coef_tbl = pd.DataFrame({'estimate': rslt.params.values,
'standard error': np.sqrt(rslt.cov_params().values.diagonal())},
index = x.columns)
coef_tbl.loc['error variance', 'estimate'] = rslt.scale
coef_tbl
print(coef_tbl)
print('-------')
print('R^2 = ',rslt.rsquared)
estimate standard error const 202.964997 43.575563 GDP_per_capita 0.000335 0.002048 error variance 753441.041101 NaN ------- R^2 = 4.619449856890778e-05
After finding all R^2 values, we can compare them side by side in a neatly formatted table.
Model = ['MLR', 'MLR','MLR','MLR','SLR','SLR','SLR']
Num_predictors = [3, 2, 2, 2, 1, 1, 1, 1]
Predictor_names = ['Life expectancy + GDP per capita + Population',
'GDP_per_capita + Population',
'Life_expectancy + Population',
'Life_expectancy + GDP_per_capita',
'Population','Life expectancy','GDP per capita']
R_sqr_val = [0.7199793960178814,
0.7176933327761046, 0.719397247629431,0.012482771328017228,
0.7134366397152028, 0.008836476822486672,4.619449856890778e-05 ]
data = {
'Model': ['MLR', 'MLR', 'MLR', 'MLR', 'SLR', 'SLR', 'SLR'],
'Num_predictors': [3, 2, 2, 2, 1, 1, 1],
'Predictor_names': ['Life expectancy + GDP per capita + Population',
'GDP_per_capita + Population',
'Life_expectancy + Population',
'Life_expectancy + GDP_per_capita',
'Population', 'Life expectancy', 'GDP per capita'],
'R_sqr_val': [0.7199793960178814,
0.7176933327761046, 0.719397247629431, 0.012482771328017228,
0.7134366397152028, 0.008836476822486672, 4.619449856890778e-05]
}
Lin_reg = pd.DataFrame(data)
Lin_reg.style.set_caption("Linear Regression for Emissions vs Predictors")
| Model | Num_predictors | Predictor_names | R_sqr_val | |
|---|---|---|---|---|
| 0 | MLR | 3 | Life expectancy + GDP per capita + Population | 0.719979 |
| 1 | MLR | 2 | GDP_per_capita + Population | 0.717693 |
| 2 | MLR | 2 | Life_expectancy + Population | 0.719397 |
| 3 | MLR | 2 | Life_expectancy + GDP_per_capita | 0.012483 |
| 4 | SLR | 1 | Population | 0.713437 |
| 5 | SLR | 1 | Life expectancy | 0.008836 |
| 6 | SLR | 1 | GDP per capita | 0.000046 |
Regression for Top 10 Emission Countries¶
When graphing the top 10 countries that produce the greatest amount of green house gases, we thought that maybe they produce a lot of weight into the results that are given in the analysis of the main dataset. To get a better understanding of how these smaller countries compare to the larger one, we thought to recreate the linear regression analysis on split datasets with the top 10 emitting countries against the others.
First, we make a list of the top 10 countries with the highest emission values using the mean of all the years, and use indexing to create a dataset of only the top 10 countries, and another dataset without them, and group them by year and country.
top_10_countries = merged_df.groupby('Country')['Emissions'].mean().nlargest(10).index
emissions_top_10 = merged_df[merged_df['Country'].isin(top_10_countries)]
emissions_without_top_10 = merged_df[~merged_df['Country'].isin(top_10_countries)]
emissions_top_10=emissions_top_10.groupby(['Country','Year']).sum().reset_index()
emissions_top_10.head()
| Country | Year | Emissions | Life_expectancy | Male | Female | GDP per capita | Region | sub-region | Population | Population_in_millions | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Australia | 2000 | 650.73 | 79.7 | 77.1 | 82.2 | 21679.24784 | Oceania | Australia and New Zealand | 19153000.0 | 19.153000 |
| 1 | Australia | 2010 | 596.36 | 81.9 | 79.8 | 84.0 | 52022.12560 | Oceania | Australia and New Zealand | 22031750.0 | 22.031750 |
| 2 | Australia | 2015 | 568.87 | 82.3 | 80.4 | 84.2 | 56755.72171 | Oceania | Australia and New Zealand | 23815995.0 | 23.815995 |
| 3 | Australia | 2019 | 608.49 | 83.0 | 81.3 | 84.8 | 55060.32610 | Oceania | Australia and New Zealand | 25364307.0 | 25.364307 |
| 4 | Brazil | 2000 | 1809.18 | 71.5 | 67.9 | 75.2 | 3749.75325 | Americas | Latin America and the Caribbean | 174790340.0 | 174.790340 |
emissions_without_top_10.head()
| Country | Emissions | Year | Life_expectancy | Male | Female | GDP per capita | Region | sub-region | Population | Population_in_millions | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 163 | South Africa | 380.86 | 2000 | 55.8 | 53.1 | 58.6 | 3032.427138 | Africa | Sub-Saharan Africa | 44967708.0 | 44.967708 |
| 166 | South Africa | 540.67 | 2010 | 57.1 | 55.0 | 59.2 | 7328.615629 | Africa | Sub-Saharan Africa | 51216964.0 | 51.216964 |
| 169 | South Africa | 542.51 | 2015 | 62.6 | 59.8 | 65.3 | 5734.633629 | Africa | Sub-Saharan Africa | 55386367.0 | 55.386367 |
| 172 | South Africa | 562.19 | 2019 | 65.3 | 62.2 | 68.3 | 6001.400814 | Africa | Sub-Saharan Africa | 58558270.0 | 58.558270 |
| 179 | Pakistan | 240.86 | 2000 | 60.1 | 59.3 | 61.0 | 576.195601 | Asia | Southern Asia | 142343578.0 | 142.343578 |
To put our theory into practice, we create visualizations of different comparisons and trends between the variables in our dataset. We compare the life expectancy against emissions to evaluate how greater emission rates affect the life expectancy of these higher emission countries versus the smaller, and include a linear trend to display the pattern.
#2
emissions_top_10_graph = alt.Chart(emissions_top_10).mark_circle().encode(
x=alt.X('Emissions:Q', scale = alt.Scale(type='sqrt', zero=False)),
y='Life_expectancy:Q',
color='Country:N',
size=alt.Size('Population:Q', scale = alt.Scale(type = 'sqrt'))
).properties(width=400, height=300)
t10 = emissions_top_10_graph.transform_regression('Emissions', 'Life_expectancy').mark_line(color = 'black')
t10em = emissions_top_10_graph + t10
emissions_not_top_10_graph = alt.Chart(emissions_without_top_10).mark_circle().encode(
x=alt.X('Emissions:Q', scale = alt.Scale(type='sqrt', zero=False)),
y='Life_expectancy:Q',
color='Country:N',
size=alt.Size('Population:Q', scale = alt.Scale(type = 'sqrt'))
).properties(width=400, height=300)
nt10 = emissions_not_top_10_graph.transform_regression('Emissions', 'Life_expectancy').mark_line(color = 'black')
nt10em = emissions_not_top_10_graph + nt10
alt.vconcat(t10em, nt10em)
Next, we depict the relationship between emissions against GDP per capita, plotting each point of each country, finding the mean GDP per capita over the years and observing how it affects the emissions output.
emissions_top_10_gdp_graph = alt.Chart(emissions_top_10).mark_circle().encode(
x='GDP per capita:Q',
y=alt.Y('Emissions:Q', scale = alt.Scale(type = 'log', zero=False)),
color='Country:N',
tooltip=['Country', 'Year', 'GDP per capita', 'Emissions', 'Population']
).properties(width=400, height=300)
bline = emissions_top_10_gdp_graph.transform_loess('GDP per capita', 'Emissions',
bandwidth = 0.45).mark_line(color = 'black')
emissions_not_top_10_gdp_graph = alt.Chart(emissions_without_top_10).mark_circle().encode(
x='GDP per capita:Q',
y=alt.Y('Emissions:Q', scale = alt.Scale(type = 'log', zero=False)),
color='Country:N',
tooltip=['Country', 'Year', 'GDP per capita', 'Emissions', 'Population']
).properties(width=400, height=300)
eline = emissions_not_top_10_gdp_graph.transform_loess('GDP per capita', 'Emissions',
bandwidth = 0.45).mark_line(color = 'black')
alt.vconcat(emissions_top_10_gdp_graph+bline, emissions_not_top_10_gdp_graph + eline)
In addition, we plot emissions against population, using a logarithmic scale on the emissions to incorporate equally the smaller emissions from the smaller countries.
xtv = alt.Chart(emissions_top_10).mark_circle().encode(
x='Population:Q',
y=alt.Y('Emissions:Q', scale = alt.Scale(type = 'log', zero=False)),
color='Country:N',
).properties(width=400, height=300)
ytv = alt.Chart(emissions_without_top_10).mark_circle().encode(
x='Population:Q',
y=alt.Y('Emissions:Q', scale = alt.Scale(type = 'log', zero=False)),
color='Country:N',
).properties(width=400, height=300)
vline = xtv.transform_loess('Population', 'Emissions',
bandwidth = 0.45).mark_line(color = 'black')
vreg = ytv.transform_regression('Population', 'Emissions').mark_line(color = 'black')
yline = ytv.transform_loess('Population', 'Emissions',
bandwidth = 0.45).mark_line(color = 'black')
yreg = ytv.transform_regression('Population', 'Emissions').mark_line(color = 'black')
alt.vconcat(xtv+vreg, ytv+yreg)
As these visualizations narrate the variations in how emissions are influenced in the top countries compared to the smaller ones, we proceed with another step in linear regression. Utilizing our segregated data, we exclude all non-numerical data, retaining only Emissions, Life Expectancy, Population, and GDP per capita. Subsequently, we conduct multiple linear regression (MLR) and simple linear regression (SLR) for both sets of high and low emission countries, presenting the results in tables at the conclusion, inclusive of our R^2 values.
reg_t10 = emissions_top_10.drop(columns=['Country','Year','Male', 'Female','Region','sub-region','Population_in_millions'])
reg_t10 = reg_t10.rename(columns = {'GDP per capita':'GDP_per_capita'})
reg_nt10 = emissions_without_top_10.drop(columns=['Country','Year','Male', 'Female','Region','sub-region','Population_in_millions'])
reg_nt10 = reg_nt10.rename(columns = {'GDP per capita':'GDP_per_capita'})
# Treat emissions as variable of interest
# MLR 1
# retrieve response
y = reg_t10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.loc[:, ['Life_expectancy', 'GDP_per_capita','Population']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.7413651601135549
#MLR 2
# retrieve response
y = reg_t10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.loc[:, ['Life_expectancy', 'Population']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.6878591169919026
#Treat emissions as variable of interest
#MLR 3
# retrieve response
y = reg_t10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.loc[:, ['Life_expectancy', 'GDP_per_capita']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.24329757166908683
#Treat emissions as variable of interest
#MLR 4
# retrieve response
y = reg_t10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.loc[:, ['GDP_per_capita','Population']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.6223115523941747
#Treat emissions as variable of interest
#SLR 1
# retrieve response
y = reg_t10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.Life_expectancy)
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.014584777637709712
#Treat emissions as variable of interest
#SLR 2
# retrieve response
y = reg_t10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.Population)
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.6140040826069113
#Treat emissions as variable of interest
#SLR 3
# retrieve response
y = reg_t10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_t10.GDP_per_capita)
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.11989222540865618
Model = ['MLR', 'MLR','MLR','MLR','SLR','SLR','SLR']
Num_predictors = [3, 2, 2, 2, 1, 1, 1, 1]
Predictor_names = ['Life expectancy + GDP per capita + Population',
'GDP_per_capita + Population',
'Life_expectancy + Population',
'Life_expectancy + GDP_per_capita',
'Population','Life expectancy','GDP per capita']
R_sqr_val_t10 = [0.7413651601135549,
0.6223115523941747, 0.6878591169919026,0.24329757166908683,
0.6140040826069113, 0.014584777637709712,0.11989222540865618 ]
data_t10= {
'Model': ['MLR', 'MLR', 'MLR', 'MLR', 'SLR', 'SLR', 'SLR'],
'Num_predictors': [3, 2, 2, 2, 1, 1, 1],
'Predictor_names': ['Life expectancy + GDP per capita + Population',
'GDP_per_capita + Population',
'Life_expectancy + Population',
'Life_expectancy + GDP_per_capita',
'Population', 'Life expectancy', 'GDP per capita'],
'R^2 of Top 10 Emission Countries': [0.7413651601135549,
0.6223115523941747, 0.6878591169919026,0.24329757166908683,
0.6140040826069113, 0.014584777637709712,0.11989222540865618 ]
}
Regression for Remaining Emission Countries¶
The procedure for this is identical to the one outlined above.
#Treat emissions as variable of interest
#MLR 1
# retrieve response
y = reg_nt10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.loc[:, ['Life_expectancy', 'GDP_per_capita','Population']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.4978334100816997
#Treat emissions as variable of interest
#MLR 2
# retrieve response
y = reg_nt10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.loc[:, ['Life_expectancy','Population']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.4943751865686107
#Treat emissions as variable of interest
#MLR 3
# retrieve response
y = reg_nt10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.loc[:, ['Life_expectancy', 'GDP_per_capita']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.01932313749396042
#Treat emissions as variable of interest
#MLR 4
# retrieve response
y = reg_nt10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.loc[:, ['GDP_per_capita','Population']])
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.48446928491764774
#Treat emissions as variable of interest
#SLR 1
# retrieve response
y = reg_nt10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.Life_expectancy)
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.018829484767021176
#Treat emissions as variable of interest
#SLR 2
# retrieve response
y = reg_nt10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.Population)
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.4614267534668359
#Treat emissions as variable of interest
#SLR 3
# retrieve response
y = reg_nt10.Emissions
# construct explanatory variable matrix
x = sm.tools.add_constant(reg_nt10.GDP_per_capita)
# print first five rows of x
mlr = sm.OLS(endog = y, exog = x)
# store results
rslt = mlr.fit()
rslt.rsquared
0.0036404170398194324
Model = ['MLR', 'MLR','MLR','MLR','SLR','SLR','SLR']
Num_predictors = [3, 2, 2, 2, 1, 1, 1, 1]
Predictor_names = ['Life expectancy + GDP per capita + Population',
'GDP_per_capita + Population',
'Life_expectancy + Population',
'Life_expectancy + GDP_per_capita',
'Population','Life expectancy','GDP per capita']
R_sqr_val_nt10 = [0.49783341008169957,
0.4844692849176476, 0.4943751865686107,0.019323137493960307,
0.4614267534668358, 0.018829484767021065,0.0036404170398192104]
data_nt10= {
'Model': ['MLR', 'MLR', 'MLR', 'MLR', 'SLR', 'SLR', 'SLR'],
'Num_predictors': [3, 2, 2, 2, 1, 1, 1],
'Predictor_names': ['Life expectancy + GDP per capita + Population',
'GDP_per_capita + Population',
'Life_expectancy + Population',
'Life_expectancy + GDP_per_capita',
'Population', 'Life expectancy', 'GDP per capita'],
'R^2 of Other Countries': [0.49783341008169957,
0.4844692849176476, 0.4943751865686107,0.019323137493960307,
0.4614267534668358, 0.018829484767021065,0.0036404170398192104]
}
Lin_reg_top_10_emission = pd.DataFrame(data_t10)
Lin_reg_top_10_emission.style.set_caption("Linear Regression for Emissions vs Predictors of Top 10 Emission Countries")
| Model | Num_predictors | Predictor_names | R^2 of Top 10 Emission Countries | |
|---|---|---|---|---|
| 0 | MLR | 3 | Life expectancy + GDP per capita + Population | 0.741365 |
| 1 | MLR | 2 | GDP_per_capita + Population | 0.622312 |
| 2 | MLR | 2 | Life_expectancy + Population | 0.687859 |
| 3 | MLR | 2 | Life_expectancy + GDP_per_capita | 0.243298 |
| 4 | SLR | 1 | Population | 0.614004 |
| 5 | SLR | 1 | Life expectancy | 0.014585 |
| 6 | SLR | 1 | GDP per capita | 0.119892 |
Lin_reg_not_top_10_emission = pd.DataFrame(data_nt10)
Lin_reg_not_top_10_emission.style.set_caption("Linear Regression for Emissions vs Predictors of Other Countries")
| Model | Num_predictors | Predictor_names | R^2 of Other Countries | |
|---|---|---|---|---|
| 0 | MLR | 3 | Life expectancy + GDP per capita + Population | 0.497833 |
| 1 | MLR | 2 | GDP_per_capita + Population | 0.484469 |
| 2 | MLR | 2 | Life_expectancy + Population | 0.494375 |
| 3 | MLR | 2 | Life_expectancy + GDP_per_capita | 0.019323 |
| 4 | SLR | 1 | Population | 0.461427 |
| 5 | SLR | 1 | Life expectancy | 0.018829 |
| 6 | SLR | 1 | GDP per capita | 0.003640 |
Summary of Findings¶
We delve into the relationships between the variables in question in our datasets, emissions, and life to determine how one might affect the other. Our initial observations reveal a notable surge in greenhouse gas emissions globally, with the most significant increase occurring in the early 2000s. Despite attempts to identify geographic patterns, emissions do not exhibit an apparent spillover across national borders, except for discernible disparities in emission levels among regions.
When we checked the emissions data, it was clear that some countries had much higher emission values than the rest. To understand this better, we took a closer look at the top 10 countries with the highest emissions and grouped the others into a category we called "Other." Plotting the yearly emissions after this grouping showed us how much impact the top emitters had compared to the remaining 185 countries. The resulting chart made us question whether the overall emissions results represented the smaller countries accurately. Later on, we will specifically analyze this difference via our regression analyses.
Turning to the life dataset, we observed a higher population count in the Asian region, totaling 1.55 trillion people and constituting 65% of the dataset's overall population. Consequently, it is reasonable to anticipate that findings from the Asian region will significantly influence the dataset's overall outcomes. The average life expectancy of 72.45 years showed a proportional increase of 0.27 years over the 20-year dataset period. A gender-based analysis revealed an intriguing trend where male life expectancy rose quicker than females, with an increase of 5.8 years for males versus 5.6 years for females. Notably, the percentage increase in life expectancy tends to diminish over the years, declining from a 2.37% rise between 2010 and 2015 to a 1.46% increase from 2015 to 2019. This decline may be associated with the increasing levels of greenhouse gas emissions. In our investigation, we explored external online sources referenced in the paragraph. Subsequent research highlighted that fossil fuels such as coal, oil, and gas significantly contribute to global climate change, accounting for over 75 percent of global greenhouse gas emissions and nearly 90 percent of all carbon dioxide emissions. Understanding the origin of these emissions and how fossil fuels enter the atmosphere is crucial. Information from www.un.org indicates that critical contributors to greenhouse gas emissions include power generation, manufacturing goods, deforestation, and transportation – all indicative of a thriving economy. Consequently, we undertook a comparative analysis, juxtaposing Gross Domestic Product (GDP) with life expectancy and examining GDP trends across various regions. This exploration revealed a positive correlation, demonstrating an increase in GDP corresponding to elevated life expectancy in each region.
Combining the life and emissions datasets facilitated a direct comparison between life expectancy and emissions, visualized on a shared graph. After organizing and refining the data, a distinct positive correlation emerged, indicating that higher life expectancy among a country's citizens tends to align with increased greenhouse gas emissions. This correlation is further associated with the observed rise in life expectancy concurrent with increased GDP per capita. These findings suggest a connection between emissions growth, GDP expansion, and prolonged life expectancy. A heatmap illustrating the numerical relationships within our dataset was employed to represent these correlations visually.
We claimed that these factors contribute to emission increases, so we conducted several linear regression models to understand which combinations of variables influence emissions. Following these analyses, the multiple linear regression incorporating all three numeric variables—population, life expectancy, and GDP—yielded the highest R^2 value, indicating that the model explains approximately 80% of the data points through the regression line. In simple linear regression and individual variable analysis, population emerged as the most significant predictor when forecasting emissions.
As mentioned earlier, we highlighted that the top 10 emitting countries significantly influence overall trends in emission production, life expectancy, and population growth. When analyzing these countries separately from the least emitting ones, distinct patterns emerge. Firstly, regarding the correlation between emissions and life expectancy, heavy-emitting countries exhibit a negative trend in life expectancy. In contrast, low-emitting countries show a positive correlation—life expectancy increases with rising emissions. Similarly, when comparing GDP to emissions, high-emitting countries demonstrate a negative trend as GDP increases, which is noteworthy. Conversely, low-emitting countries display a somewhat different trend, where very low GDP initially leads to increased emissions, then trends negatively as their total GDP approaches the mean. Lastly, comparing population and emissions reveals a sharp increase in emissions with population growth in high-emitting countries, while low-emitting countries experience a more gradual rise in emissions.
With this new information, we conducted linear regression on the divided dataset for high and low-emitting countries to identify key factors. In the case of high emitting countries, after running various regressions, the multiple linear regression, encompassing all three numeric variables (population, life expectancy, and GDP), produced the highest R^2 value, indicating that the model explains approximately 74% of the data points along the regression line. Turning to low-emitting countries, both models—utilizing three predictors and two predictors (life expectancy and population)—exhibited similar explanatory variability, ranging between 49% and 50%. When considering simple linear regression and evaluating each variable independently, population emerged as the most significant predictor for emissions in low and high-emitting countries.
In summary, our thorough analysis has provided a conclusive response to our question of interest concerning emissions and life expectancy. Despite variations in magnitude across different regions, we observed a consistent positive correlation between life expectancy and emissions. This alignment is not surprising, considering the influence of additional factors in our datasets. Population value also positively correlates with emissions, interconnected with GDP per capita. Our linear regression underscores the importance of considering all numerical variables for accurate emission predictions, emphasizing the complexity of relying solely on a single factor. It intuitively makes sense that emissions and GDP are positively correlated, as economic activities like manufacturing and deforestation contribute to both. Moreover, a higher population in a region leads to increased transportation and greenhouse gas emissions, and a rising GDP indicates improved healthcare practices, contributing to an extended life expectancy in a flourishing economy.